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Abstract. For many classification and regression problems, a large number of features are available 
for possible use — this is typical of DNA microarray data on gene expression, for example. Often, 
for computational or other reasons, only a small subset of these features are selected for use in 
a model, based on some simple measure such as correlation with the response variable. This 
procedure may introduce an optimistic bias, however, in which the response variable appears to be 
more predictable than it actually is, because the high correlation of the selected features with the 
response may be partly or wholely due to chance. We show how this bias can be avoided when 
using a Bayesian model for the joint distribution of features and response. The crucial insight is 
that even if we forget the exact values of the unselected features, we should retain, and condition 
on, the knowledge that their correlation with the response was too small for them to be selected. In 
this paper we describe how this idea can be implemented for "naive Bayes" models of binary data. 
Experiments with simulated data confirm that this method avoids bias due to feature selection. 
We also apply the naive Bayes model to subsets of data relating gene expression to colon cancer, 
and find that correcting for bias from feature selection does improve predictive performance. 

1 Introduction 

Regression and classification problems that have a large number of available "features" (also known 
as "inputs" , "covariates" , or "predictor variables" ) are becoming increasingly common. Such prob- 
lems arise in many application areas. Data on the expression levels of tens of thousands of genes 
can now be obtained using DNA microarrays, and used for tasks such as classifying tumors. Docu- 
ment analysis may be based on counts of how often each word in a large dictionary occurs in each 
document. Commercial databases may contain hundreds of features describing each customer. 

Using all the features available in such problems is often infeasible. Using too many features 
can result in "overfitting" when simple statistical methods such as maximum likelihood are used, 
with the consequence that poor predictions are made for the response variable (e.g., the class) in 
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new items. More sophisticated Bayesian methods can avoid such statistical problems, but using a 
large number of features may still be undesirable. We will focus primarily on situations where the 
computational cost of looking at all features is too burdensome. Another issue in some applications 
is that using a model that looks at all features will require measuring all these features when making 
predictions for future items, which may sometimes be costly. In some situations, models using few 
features may be preferred because they are easier to interpret. 

For such reasons, modelers often use only a subset of features, chosen by some simple indicator 
of how useful they might be in predicting the response variable — see, for example, the papers 
in (Guyon, et al. 2006). For both regression problems with a real- valued response variable and 
classification problems with a binary (0/1) class variable, one suitable measure of how useful a 
feature may be is the sample correlation of the feature with the response. If the absolute value of 
this sample correlation is small, we might decide to omit the feature from our model. This criterion 
is not perfect, of course — it may result in a relevant feature being ignored if its relationship with 
the response is non-linear, and it may result in many redundant features being retained even when 
they all contain essentially the same information. Sample correlation is easily computed, however, 
and hence is an attractive criterion for screening a large number of features. 

Unfortunately, a model that uses only a subset of features, selected based on their high correlation 
with the response, will be optimistically biased — i.e., predictions made using the model will (on 
average) be more confident than is actually warranted. For example, we might find that the model 
predicts that certain items belong to class 1 with probability 90%, when in fact only 70% of these 
items are in class 1. In a situation where the class is actually completely unpredictable from the 
features, a model using a subset of features that purely by chance had high sample correlation with 
the class may produce highly confidence predictions that have less actual chance of being correct 
than just guessing the most common class. 

This optimistic bias comes from ignoring a basic principle of Bayesian inference — that we 
should base our conclusions on probabilities that are conditional on all the available information. 
If we have an appropriate model, this principle would lead us to use all the features. This would 
produce the best possible predictive performance. However, we assume here that computational or 
other pragmatic issues make using all features unattractive. When we therefore choose to "forget" 
some features, we can nevertheless still retain the information about how we selected the subset 
of features that we use in the model. Properly conditioning on this information when forming the 
posterior distribution eliminates the bias from feature selection, producing predictions that are as 
good as possible given the information in the selected features, without the overconfidence that 
comes from ignoring the feature selection process. 

In the next section, we describe this idea in more detail, and discuss the difficulties of imple- 
menting it. We then show how the idea can be applied to a simple "naive Bayes" classification 
model with binary features that are assumed to be independent given the value of the response 
variable. We apply this naive Bayes model to simulated data and to data regarding gene expres- 
sion in colon cancer, showing that bias correction does indeed improve predictions. Our method 
is more generally applicable, however. In the final section, we briefly discuss our work on mixture 
models for binary data and on factor analysis models for real- valued data, as well as other possible 
applications. 
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2 Our method for avoiding selection bias 



Suppose we wish to predict a response variable, y, based on the information in the numerical 
features xi, . . . , Xp, which we sometimes write as a vector, x. Our method is applicable both when 
y is a binary (0/1) class indicator, as is the case for the naive Bayes models discussed later, and 
when y is real-valued. We assume that we have complete data on n "training" cases, for which the 
responses are y^^\ ■ ■ ■ , y^"'^ (collectively written as t/*'""'") and the feature vectors are x^^\ . . . , a;^") 
(collectively written as a;'"'""). (Note that when y, x, or xt are used without a superscript, they 
will refer to some unspecified case.) We wish to predict the response for one or more "test" cases, 
for which we know only the feature vector. Our predictions will take the form of a distribution for 
y, rather than just a single- valued guess. 

We are interested in problems where the number of features, p, is quite big — perhaps as large as 
ten or a hundred thousand — and accordingly (for pragmatic reasons) we intend to select a subset 
of features based on the absolute value of each feature's sample correlation with the response. The 
sample correlation of the response with feature t is defined as follows (or as zero if the denominator 
below is zero): 

' n 

5; (yW - y) (xf ) - Xt) 
COR(y*^^'", xr'") = , - , = (1) 

where y = ^Yl V^^^ ~ n S ■ '^^^ numerator can be simplified to ^ (y^*) — y)xf\ 

i=l i=l i=l 

Although our interest is only in predicting the response, we assume that we have a model for 
the joint distribution of the response together with all the features. From such a joint distribution, 
with probability or density function P{y, xi, . . . , Xp), we can obtain the conditional distribution for 
y given any subset of features, for instance P{y | xi, . . . ,Xk), with k < p. This is the distribution 
we need in order to make predictions based on this subset. Note that selecting a subset of features 
makes sense only when the omitted features can be regarded as random, with some well-defined 
distribution given the features that are retained, since such a distribution is essential for these 
predictions be meaningful. This can be seen from the following expression: 

P{y\xi,...,Xk) 

j P{y I xi, . . . , Xfc, Xfc+i, . . . , Xp) P(xfc+i, . . . , Xp I xi, . . . , Xfc) dxk+i ■ ■ ■ dxp (2) 

If P(xfe+i, . . . , Xp I xi, . . . , Xfc) does not exist in any meaningful sense — as would be the case, for 
example, if the data were collected by an experimenter who just decided arbitrarily what to set 
Xfc+i, . . . , Xp to — then P{y | xi, . . . , Xk) will also have no meaning. 

Consequently, features that cannot usefully be regarded as random should always be retained. 
Our general method can accommodate such features, provided we use a model for the joint dis- 
tribution of the response together with the random features, conditional on given values for the 
non-random features. However, for simplicity, we will ignore the possible presence of non-random 
features in this paper. 
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We will assume that a subset of features is selected by fixing a threshold, 7, for the absolute 
value of the correlation of a selected feature with the response. We then omit feature t from 
the feature subset if |COR(y"''"", xf'")] < 7, retaining those features with a greater degree of 
correlation. Another possible procedure is to fix the number of features, k, that we wish to retain, 
and then choose the k features whose correlation with the response is greatest in absolute value, 
breaking any tie at random. If s is the retained feature with the weakest correlation with the 
response, we can set 7 to |COR(?/'"''"", x^"''"")!, and we will again know that if t is any omitted feature, 
|COR(y"''"", xf'")! < 7. If either the response or the features have continuous distributions, exact 
equality of sample correlations will have probability zero, and consequently this situation can be 
treated as equivalent to one in which we fixed 7 rather than k. If sample correlations for different 
features can be exactly equal, we should theoretically make use of the information that any possible 
tie was broken the way that it was, but ignoring this subtlety is unlikely to have any practical effect, 
since ties are still likely to be rare. 

Regardless of the exact procedure used to select features, we will denote the number of features 
retained by k, we will renumber the features so that the subset of retained features is xi, . . . ,x^, 
and we will assume we know that |COR(?/*'''"", xf '")| < 7 for t = k+1, . . . ,p. 

We can now state the basic principle behind our bias-avoidance method: When forming the 
posterior distribution for parameters of the model using a subset of features, we should condition 
not only on the values in the training set of the response and of the k features we retained, but 
also on the fact that the other p—k features have sample correlation with the response that is less 
than 7 in absolute value. That is, the posterior distribution should be conditional on the following 
information: 

y*^^'", x'{.r, |COR(y*^^'", xr'")| < iov t = k + 1, . . . ,p (3) 

where x^{!p = (xf'''", . . . , x^"'''"). 

We claim that this procedure of conditioning on the fact that selection occurred will eliminate 
the bias from feature selection. Here, "bias" does not refer to estimates for model parameters, but 
rather to our estimate of how well we can predict responses in test cases. Bias in this respect is 
also referred to as a lack of "calibration" — that is, the predictive probabilities do not represent 
the actual chances of events (Dawid 1982). If the model describes the actual data generation 
mechanism, and the actual values of the model parameters are indeed randomly chosen according 
to our prior, Bayesian inference always produces well-calibrated results, on average (with respect 
to the distribution of data and parameter values chosen from the prior). 

In justifying our claim that this procedure avoids selection bias, we will assume that our model for 
the joint distribution of the response and all features, and the prior we chose for it, are appropriate 
for the problem, and that we would therefore not see bias if we predicted the response using all 
the features. Now, imagine that rather than selecting a subset of features ourselves, after seeing 
all the data, we instead set up an automatic mechanism to do so, providing it with the value of 7 
to use as a threshold. This mechanism, which has access to all the data, will compute the sample 
correlations of all the features with the response, select the subset of features by comparing these 
sample correlations with 7, and then erase the values of the omitted features, delivering to us only 
the identities of the selected features and their values in the training cases. If we now condition on 
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all the information that we know, but not on the information that was available to the selection 
mechanism but not to us, we will obtain unbiased inferences. The information we know is just that 
of (3) above. 

Our method requires computation of an adjustment factor, P{S \ a, y"'"^"^), where a is the set of 
parameters whose likelihood needs adjusting, and S represents the information regarding selection, 
namely that |COR(y'™", '")| < 7 for t = k + 1, ■ ■ ■ ,p. Computing this factor is much easier if 
the xf'" are conditionally independent given a and y*''^'", since we can then write it as a product 
of factors pertaining to the various omitted features. For the models we consider, these factors are 
also all the same, since nothing distinguishes one omitted feature from another. We can then write 

p 

P{S I Q, y*'^'") = H P(|C0R(2/*^^'",xr'")| < 7 I «, v'^^n (4) 

t=k+l 

P(|COR(y"^=''",x^'^"'")| < 7|a, y"^"'")]^ (5) 

where in the second expression, t represents any of the omitted features. Note that in this expres- 
sion, y*'''''" is conditioned on, and hence considered fixed, whereas xf"" is random. Since the time 
needed to compute this adjustment factor does not depend on the number of omitted features, we 
may hope to save a large amount of computation time by omitting many features. 

Computing the single factor we do need is not trivial, however, since it involves integrals over 
both and any parameters specific to particular features. As we will see, however, efficient 

computation is possible for the naive Bayes model. 



3 Application to naive Bayes models with binary features 

In this section we show how to apply the bias correction method to Bayesian naive Bayes models in 
which both the features and the response are binary. Binary features are natural for some problems 
(e.g., test answers that are either correct or incorrect), or may result from thresholding real- valued 
features. Such thresholding can sometimes be beneficial — in a document classification problem, 
for example, whether or not a word is used at all may be more relevant to the class of the document 
than how many times it is used. Naive Bayes models assume that features are independent given the 
response. This assumption is often incorrect, but such simple naive Bayes models have nevertheless 
been found to work well for many practical problems. Here we show how to correct for selection 
bias in binary naive Bayes models, whose simplicity allows the required adjustment factor to be 
computed very quickly. Simulations reported in the next section show that substantial bias can be 
present with the uncorrected method, and that it is indeed corrected by conditioning on the fact 
that feature selection occurred. We then apply the method to real data on gene expression relating 
to colon cancer, and again find that our bias correction method improves predictions. 

3.1 Definition of the binary naive Bayes model 

Let a;^*) = (x^*\ • • • ,Xp^) be the vector of p binary features for case i, and let y^*^ be the binary 
response for case i, indicating the class. For example, y^*-* = 1 might indicates that cancer is 
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present for patient i, and y^*) = indicate that cancer is not present. Cases are assumed to be 
independent given the values of the model parameters (ie, exchangeable a priori). The probability 
that y = 1 in a case is given by the parameter ip. Conditional on the class y in some case (and 
on the model parameters), the features assumed to be independent, and to have 

Bernoulli distributions with parameters (py^i, ■ ■ ■ , <Py,p, collectively written as cf)y, with cf) = (<^q, (pi) 



representing all such parameters. In other words, the data is modeled as 

I -0 ~ Bernoulli (ip), for z = 1, . . . , n (6) 

Xj*'' I y^^\ cf) ~ Bernoulli {(pyiiij), for i = 1, . . . , n and j = I, . . . ,p (7) 

We use a hierarchical prior that expresses the possibility that some features may have almost 
the same distribution in the two classes. In detail, the prior has the following form: 

~ Beta (A, /o) (8) 

a ~ Inverse-Gamma(a, b) (9) 

9i,...,ep ™ Uniform(0,l) (10) 

4>o,j, (t)i,j I a, ™ Beta(Q6'j, 0(1-6*^)), for j = l,...,p (11) 



The hyperparameters 6 = {6i, . . . ,6p) are used to introduce dependence between (pQj and <^ij, 
with a controlling the degree of dependence. Features for which <j)Qj and (pij differ greatly are more 
relevant to predicting the response. When a is small, the variance of the Beta distribution in (11), 
which is 9j {l—6j) / (a+1), is large, and many features are likely to have predictive power, whereas 
when a is large, it is likely that most features will be of little use in predicting the response, since 
(poj and (pij are likely to be almost equal. We chose an Inverse-Gamma prior for a (with density 
function proportional to a~^^"*"") exp(— 6/a)) because it has a heavy upward tail, allowing for the 
possibility that a is large. Our method of correcting selection bias will have the effect of modifying 
the likelihood in a way that favors larger values for a than would result from ignoring the effect of 
selection. 

3.2 Integrating away ip and cp 

Although the above model is defined with ip and cf) parameters for better conceptual understanding, 
computations are simplified by integrating them way analytically. 

Integrating away ip, the joint probability of y*''^'" = (y^^\ . . . ,y^"^) is as follows, where /( • ) is 
the indicator function, equal to 1 if the enclosed condition is true and if it is false: 

Piy'-n = / r f^r/f V ^ - ^)^' ^ ^l-#-^' W (12) 
Jo luojii/ij 

/o, /i, E i{y^'^ = o), E i{y^'^ = i)) (i3) 
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The function U is defined as 

no rii 

TTtf f \ r(/o + /i) r(/o + no r(/i + m £=i 

r(/o)r(/i) r(/o + /i + no + ni) n (/o + /i + ^ - 1) 

The products above have the value one when the upper limits of ng or ni are zero. The joint 
probability of and the response, y* , for a test case is similar: 

(n n \ 

h,fi, E l{y'^'^=0) +l(y* = 0), E /(y« = l)+/(y* = l) (15) 
i=i i=i ^ 

Dividing P{y^'"^''\y*) by P^y^''^^'^) gives 

^'(y* I y'"") = Bernouni(y*;V') (16) 



Here, Bernoulh (y; V') = (l-V')^"^ and V; = (/i+A^i) / (/o + /i+n), with iVy = E /(y^^^ = y). 
Note that is just the posterior mean of ■0 based on y^^\ . . . , y^"^. 

Similarly, integrating over i;^>o,j and we find that 

1 

P{xf'- I ^„ a, y-^'") = n ^(«%' «(l-^.)> /,j) (17) 

where O^,,,- = E /(y(^) = y, xf = 0) and 1^,, = E I(y(^) = y, xf = 1). 

With ij) and integrated out, we need deal only with the remaining parameters, a and 6. 
Note that after eliminating ■0 and the the cases are no longer independent (though they are 
exchangeable). However, conditional on the responses, y*'^="", and on a, the values of different 
features are still independent. This is crucial to the efficiency of the computations described below. 

3.3 Predictions for test cases 

We first describe how to predict the class for a test case when we are either using all features, or 
using a subset of features without any attempt to correct for selection bias. We then consider how 
to make predictions using our method of correcting for selection bias. 

Suppose we wish to predict the response, y*, in a test case for which we know the retained 
features x\.^ = {x\,-- - ,ic^.) (having renumbered features as necessary). For this, we need the 
following predictive probability: 

P(y* I y"^"'") P(x* . I y*, x^'f", y"^"'") 
P{y*\xl,„xXt.y'-) = (18) 

E P{y* = y I y''"'") P{xli, \y* = y, , y*--) 

y=0 

le, we evaluate the numerator above for y* = and y* = 1, then divide by the sum to obtain the 
predictive probabilities. The first factor in the numerator, P{y* \ y"^"""), is given by equation (16). 
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It is sufficient to obtain the second factor up to a proportionality constant that doesn't depend on 
y*, as fohows: 

U/ ™train .tram \ 

This can be computed by integrating over a, noting that conditional on a the features are inde- 
pendent: 



p{xIj„ x^t I y*, y"^'") = J ^(«) P{<:k, <:r I «> y*, y"^'") da (20) 

= / ^(«) n ^(^i' ^r" I «' y"'''") da (21) 

Each factor in the product above is found by using equation (17) and integrating over Oj: 

P{x*, I a, y*, y'^^'") = [' P{x* \ 9,, a, x'f'\ y'^'", y*) P{xf'- \ Q,, a, y'^^'") d9j (22) 

Jo 

= / Bernouni(£c*;(^y. j) TT UiaOj, ail-Oj), lyj, Oyj)dej (23) 

where (py%j = {a9j + Iy%j) / (a + ^y), the posterior mean of (pyj given a and 6j. 

When using k features selected from a larger number, p, the predictions above, which are con- 
ditional on only x^"'.^'" and y*'^'", are not correct — we should also condition on the event, S, 
that |COR(y*"''"', rr*'"'''")! < 7 for j = A; -|- 1, . . . ,p. We need to modify the predictive probability 
of equation (18) by replacing P(£c*.^ | y*, x^{.p, with P{xl.j^ \ y*, xf^^"", y'"^'", S), which is 

proportional to P(a;J.^, ic'j^'.'^'", S \ y*, y"''""). Analogously to equations (20) and (21), we obtain 

Pixlf,, xlT, S I y*, y'-'") = y" P(a) Pixl,,, x^T, 5 | a, y*, y'^'") da (24) 

= f P{a) PiS I a, y'^^'") J] P(ic*, xf | a, y*, y'^^'") da (25) 

i=i 

The factors for the k retained features are computed as before, using equation (23). The additional 
correction factor that is needed (presented earlier as equation (5)) is 

p 

P{S I a, y*^"'") = Yl ^'(|C0R(y"^=''",x7'"")| < 7 | a, y*™) (26) 
j=k+i 

= P(|COR(y*™,xr'")| < 7 I a, y*'"'") ^ (27) 

where t is any of the omitted features, all of which have the same probability of having a small 
correlation with y. We discuss how to compute this adjustment factor in the next section. 

To see intuitively why this adjustment factor will correct for selection bias, recall that as dis- 
cussed in Section (3.1), when a is small, features will be more likely to have a strong relationship 
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with the response. If the hkehhood of a is based only on the selected features, which have shown 
high correlations with the response in the training dataset, it will favor values of a that are inap- 
propriately small. Multiplying by the adjustment factor, which favors larger values for a, undoes 
this bias. 

We compute the integrals over a in equations (21) and (25) by numerical quadrature. We 
use the midpoint rule, applied to u = F~^(a), where is the inverse cumulative distribution 
function for the Inverse-Gamma(a, 6) prior for a. The prior for u is uniform over (0,1), and so 
needn't be explicitly included in the integrand. With K points for the midpoint rule, the effect is 
that we average the value of the integrand, without the prior factor, for values of a that are the 
0.5/i^, 1.5/X, . . . , 1 — 0.5/i^ quantiles of its Inverse-Gamma prior. For each a, we use Simpson's 
Rule to compute the one-dimensional integrals over Qj in equation (23). 



3.4 Computation of the adjustment factor 

Our remaining task is to compute the adjustment factor of equation (27), which depends on the 
probability that a feature will have correlation less than 7 in absolute value. Computing this 
seems difficult — we need to sum the probabilities of tcf '"" given a and Qt over all con- 

figurations of a;^™" for which |COR(y*'''"", '")| < 7 — but the computation can be simplified 
by noticing that COR(xJ'''''", y'"""") can be written in terms of /q = XliLi -^(y^*^ ~ 0, xf' = 1) and 
h = YJl=iI{y^^ = 1> 4^ = 1)> as follows: 



(0 



COR(xr'", ?/'^"'") = , , (28) 



{0-y)Io + (l-y)/i 
Vny(l-y) V^o + A - (/o + Ii)V 



n 



(29) 



We write the above as Cor(/o, Ii,y), taking n as known. This function is defined for < /q < n{l—y) 
and < Ji < ny. 

Fixing n, y, and 7, we can define the following sets of values for Iq and Ii (for some feature Xt) 
in terms of the resulting correlation with y: 

(30) 
(31) 
(32) 
(33) 
(34) 

A feature will be discarded if (/q, Ii) € L- U Lq U and retained if {Iq, Ii) G U H^. These 
sets are illustrated in Figure 1. 





= {UO: 


Jl) 


: Cor(Io,Ii,y) = 0} 
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: 0<Cor(/o,/i,y) <7} 


L_ 


= {{h. 


^h) 


: -7<Cor(/o,/i,y)<0} 


H+ 


= {{h. 


^h) 


: 7 < Cor(/o,/i,y)} 




= {[h: 


.h) 


: Cor(/o,/i,|/) < -7} 



9 



14 


+1 


.00 


+0 


.90 


+0 


.81 


+0 


.72 


+0 


.62 


+0 


.53 


+0 


.42 


+0 


.29 




13 


+0 


.91 


+0 


.80 


+0 


.70 


+0 


.60 


+0 


.49 


+0 


.38 


+0 


.25 


+0 


.09 


-0 


.16 


12 


+0 


.83 


+0 


.72 


+0 


.61 


+0 


.50 


+0 


.39 


+0 


.27 


+0 


.13 


-0 


.03 


-0 


.24 


11 


+0 


.76 


+0 


.64 


+0 


.52 


+0 


.41 


+0 


.30 


+0 


.17 


+0 


.04 


-0 


.11 


-0 


.30 


10 


+0 


.69 


+0 


.57 


+0 


.45 


+0 


.33 


+0 


.21 


+0 


.09 


-0 


.04 


-0 


.18 


-0 


.36 


9 


+0 


.63 


+0 


.50 


+0 


.38 


+0 


.26 


+0 


.14 


+0 


.02 


-0 


.11 


-0 


.25 


-0 


.41 


8 


+0 


.57 


+0 


.44 


+0 


.31 


+0 


.19 


+0 


.07 


-0 


.05 


-0 


.18 


-0 


.31 


-0 


.46 


7 


+0 


.52 


+0 


.38 


+0 


.24 


+0 


.12 





.00 


-0 


.12 


-0 


.24 


-0 


.37 


-0 


.52 


6 


+0 


.46 


+0 


.31 


+0 


.18 


+0 


.05 


-0 


.07 


-0 


.19 


-0 


.31 


-0 


.44 


-0 


.57 


5 


+0 


.41 


+0 


.25 


+0 


.11 


-0 


.02 


-0 


.14 


-0 


.26 


-0 


.38 


-0 


.50 


-0 


.63 


4 


+0 


.36 


+0 


.18 


+0 


.04 


-0 


.09 


-0 


.21 


-0 


.33 


-0 


.45 


-0 


.57 


-0 


.69 


3 


+0 


.30 


+0 


.11 


-0 


.04 


-0 


.17 


-0 


.30 


-0 


.41 


-0 


.52 


-0 


.64 


-0 


.76 


2 


+0 


.24 


+0 


.03 


-0 


.13 


-0 


.27 


-0 


.39 


-0 


.50 


-0 


.61 


-0 


.72 


-0 


.83 


1 




--1 


.09 


-0 


.25 


-0 


.38 


-0 


.49 


-0 


.60 


-0 


.70 


-0 


.80 


-0 


.91 









.29 


-0 


.42 


-0 


.53 


-0 


.62 


-0 


.72 


-0 


.81 


-0 


.90 


-1 


.00 











1 




2 




3 




4 




5 




6 




7 




8 



Figure 1: The Cor function for a dataset with n = 22 and y = 14/22. The values of Cor(Io,/i,y) 
are shown for the vahd range of /q and Ii. Using 7 = 0.2, the values of (IqJi) in Lq are shown in 
dark grey, those in L_ or in medium grey, and those in i7_ or in light grey. 



We can write the probability needed in equation (27) using either L_, Lq, and L+ or and 
H^. We will take the latter approach here, as follows: 

P(|C0R(3;p'",?/*'-"'")| <7 I a, y*''"'") = 1 - P{{Io,Ii) £ H^UH+ \ a, y"-"'") (35) 

= 1 - ^ P(/o, h I a, y'-'") (36) 

(/o,-fi) G 

We can now exploit symmetries of the prior and of the Cor function to speed up computation. 
First, note that Cor(/o, -^1, y) = —Coic{n{l—y) — lQ,ny — Ii,y), as can be derived from equation (29), 
or by simply noting that swapping the feature values (0 and 1) should change only the sign of the 
correlation. The one-to-one mapping (Io,Ii) — > (n(l — y) — Io,ny — Ii), which maps and 
and vice versa (similarly for L_ and -L+), therefore leaves Cor unchanged. The priors for 6 and cf) 
(see (10) and (11)) are symmetrical with respect to the class labels and 1, so the prior probability 
of (/q, h) is the same as that of (n(l— y) — Iq, ny — Ii). We can therefore rewrite equation (36) as 

P( |COR(xr'",y'^^'")| < 7 I a, y'^^'") = 1 - 2 ^ P(/o, h \ a, y''^'") (37) 

At this point we write the probabilities for Iq and Ii in terms of an integral over 9^, and then 
swap the order of summation and integration, obtaining 

J2 Pih, h I a, y"^'") = f Yl ^(^0, h I a, Ou y"^^'") dOt (38) 
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The integral over 6t can be approximated using some one-dimensional numerical quadrature method 
(we use Simpson's Rule), provided we can evaluate the integrand. 

The sum over can easily be delineated because Cor(/o,/i,y) is a monotonically decreasing 
function of Jq, and a monotonically increasing function of Ji, as may be confirmed by differentiating 
with respect to Iq and Ii. Let 60 be the smallest value of Ji for which Cor(0,/i,y) > 7. Taking 
the ceiling of the solution of Cor(0,li,y) = 7, we find that 60 = [1/(1/'^ + (1 ~ y)/(^y7^))l- 
bo < h 1^ ny, let r/^ be the largest value of Iq for which Cor(/o, /i, y) > 7. We can write 

ny Ti^ 

P{h, h I a, Ot, y'-^'n = J2 ^(^0, h I a, 9t, y'-'") (39) 
(/o,A)e//+ i,=bo io=o 

Given a and 9t, Iq and Ii are independent, so we can reduce the computation needed by rewriting 
the above expression as follows: 

ny Ti^ 
(/o,/i)GH+ h=bo Io=0 

Note that the inner sum can be updated from one value of Ii to the next by just adding any 
additional terms needed. This calculation therefore requires 1 + ny — bo < n evaluations of 
P(/i I a, 9t, y'''''") and l+r„^ < n evaluations of P{Io \ a, 9t, y*''^'"). 

To compute P{Ii | ct, 6t, y"'=""), we multiply the probability of any particular value for 
in which there are Ii cases with y = 1 and xt = 1 by the number of ways this can occur. The 
probabilities are found by integrating over (/>o,t and (pi^t, as described in Section 3.2. The result is 



P{h I a, Ou y"^'") = ( "Jj^ J UiaOu a(l-^t), ny - h) (41) 



Similarly, 



P(/o I a, eu y"^'") = ( ""^^^^ ) U{aeu a(l-^t), Iq, n{l-y) - h) (42) 



One can easily derive simple expressions for P{Ii \ a, Ot, y"''''") and P{Iq \ a, Ot, y"''"") in terms 
of P{Ii — 1 I a, Oti y''''''") and P(/o — 1 I ^ij y''"""), which avoid the need to compute gamma 
functions or large products for each value of Iq or Ii when these values are used sequentially, as in 
equation (40). 



4 A Simulation Experiment 

In this section, we use a dataset generated from the naive Bayes model defined in Section 3.1 to 
demonstrate the lack of calibration that results when only a subset of features is used, without 
correcting for selection bias. We show that our bias-correction method eliminates this lack of cali- 
bration. We will also see that for the naive Bayes model only a small amount of extra computation 
time is needed to obtain the adjustment factor needed by our method. 
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Fixing a = 300, and p = 10000, we used equations (7), (10) and (11) to generate a set of 200 
training cases and a set of 2000 test cases, both having equal numbers of cases with y = and 
y = I. We then selected four subsets of features, containing 1, 10, 100, and 1000 features, based 
on the absolute values of the sample correlations of the features with y. The smallest correlation 
(in absolute value) of a selected feature with the class was 0.36, 0.27, 0.21, and 0.13 for these 
four subsets. These are the values of 7 used by the bias correction method when computing the 
adjustment factor of equation (27). Figure 2 shows the absolute value of the sample correlation 
in the training set of all 10000 features, plotted against the sample correlation in the test set. As 
can be seen, the high sample correlation of many selected features in the training set is partly 
or wholely a matter of chance, with the sample correlation in the test set (which is close to the 
real correlation) often being much less. The role of chance is further illustrated by the fact that 
the feature with highest sample correlation in the test set is not even in the top 1000 by sample 
correlation in the training set. 

For each number of selected features, we fit this data using the naive Bayes model with the 
prior for ip (equation (8)) having /o = /i = 1 and the prior for a (equation (9)) having shape 
parameter a = 0.5 and rate parameter 6 = 5. We then made predictions for the test cases using the 
methods described in Section 3.3. The "uncorrected" method, based on equation (18), makes no 
attempt to correct for the selection bias, whereas the "corrected" method, with the modification 
of equation (25), produces predictions that account for the procedure used to select the subset of 
features. We also made predictions using all 10000 features, for which bias correction is unnecessary. 

We compared the predictive performance of the corrected method with the uncorrected method 
in several ways. First, we looked at the error rate when classifying test cases by thresholding the 
predictive probabilities at 1/2. As can be seen in Figure 3, there is little difference in the error 
rates with and without correction for bias. However, the methods differ drastically in terms of the 
expected error rate — the error rate we would expect based on the predictive probabilities for the 
test cases, equal to (1/iV) Yli P^^ Kp^^ < 0-5) + {1-p^'^) I{p^^ > 0.5), where is the predictive 
probability of class 1 for test case i. The predictive probabilities produced by the uncorrected 
method would lead us to believe that we would have a much lower error rate than the actual 
performance. In contrast, the expected error rates based on the predictive probabilities produced 
using bias correction closely match the actual error rates. 

Two additional measures of predictive performance are shown in Figure 4. One measure of 
performance is minus the average log probability of the correct class in the N test cases, which is 
-(1/iV) Yl^^i [y(*Mog(p(*)) + (l-y(*))log(l-pW)]. This measure heavily penalizes test cases where 
the actual class has a predictive probability near zero. Another measure, less sensitive to such 
drastic errors, is the average squared error between the actual class (0 or 1) and the probability 
of class 1, given by (1/-/V) ^^^^(y^*'* — p^*))^. The corrected method outperforms the uncorrected 
method by both these measures, with the difference being greater for minus average log probability. 
Interestingly, performance of the uncorrected method actually gets worse when going from 1 feature 
to 10 features. This may be because the single feature with highest sample correlation with the 
response does have a strong relationship with the response (as may be likely in general), whereas 
some other of the top 10 features by sample correlation have little or no real relationship. 

We also looked in more detail at how well calibrated the predictive probabilities were. Table 1 
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Correlation in the training set 



Figure 2: The absolute value of the sample correlation of each feature with the binary response, 
in the training set, and in the test set. Each dot represents one of the 10000 binary features. The 
training set correlations of the 1st, 10th, 100th, and 1000th most correlated features are marked 
by vertical lines. 
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Uncorrected 
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number of features selected 
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10 100 1000 
number of features selected 



10000 



Figure 3: Actual and expected error rates with varying numbers of features selected, with and 
without correction for selection bias. The solid line is the actual error rate on test cases. The 
dotted line is the error rate that would be expected based on the predictive probabilities. 




d 1 1 1 g 1 1 1 ^ n 1 1 1 
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Figure 4: Performance in terms of average minus log probability and average squared error, and 
computation time, with varying numbers of features selected, with and without correction for 
selection bias. The left plot shows minus the average log probability of the correct class for test 
cases, with 1, 10, 100, 1000, and all 10000 features selected. The dashed line is with bias correction, 
the dotted line without. The middle plot is similar, but shows average squared error on test cases. 
The right plot shows the computation time needed for the methods (the two lines almost coincide). 
Note that when all 10000 features are used, there is no difference between the corrected and 
uncorrected methods. 



14 









1 feature selected out of 10000 


10 features selected out of 10000 








Coriccted 




Uncorrected 




Corrected 


Uncorrected 


Catcf 


rory 


# 


Pred Actual 


# 


Pred Actual 


# 


Pred 


Actual 


# 


Pred 


Actual 


0.0 - 


0.1 



















237 


0.046 


0.312 


0.1 - 


0.2 












3 


0.174 


0.000 


349 


0.149 


0.444 


0.2 - 


0.3 












126 


0.270 


0.294 


68 


0.249 


0.500 


0.3 - 


0.4 







1346 


0.384 0.461 


467 


0.360 


0.420 


300 


0.360 


0.443 


0.4 - 


0.5 


1346 


0.446 0.461 







566 


0.462 


0.461 


189 


0.443 


0.487 


0.5 - 


0.6 












461 


0.554 


0.566 


48 


0.546 


0.417 


0.6 - 


0.7 


654 


0.611 0.581 







276 


0.643 


0.616 


238 


0.650 


0.588 


0.7 - 


0.8 







654 


0.736 0.581 


97 


0.733 


0.742 


180 


0.737 


0.567 


0.8 - 


0.9 












4 


0.825 


0.750 


192 


0.864 


0.609 


0.9 - 


1.0 



















199 


0.943 


0.668 



100 features selected out of 10000 



# 

155 
247 
220 
225 
237 
227 
202 
214 
182 
91 



Corrected 
Pred Actual 



0.067 
0.151 
0.247 
0.352 
0.450 
0.545 
0.650 
0.749 
0.847 
0.935 



0.077 
0.162 
0.286 
0.356 
0.494 
0.586 
0.728 
0.785 
0.857 
0.923 



Uncorrected 
# Pred Actual 



717 
133 
70 
68 
58 
78 
77 
80 
98 
621 



0.017 
0.150 
0.251 
0.351 
0.451 
0.552 
0.654 
0.746 
0.852 
0.979 



0.199 
0.391 
0.429 
0.515 
0.500 
0.603 
0.532 
0.662 
0.633 
0.818 



1000 features selected out of 10000 



Corrected 
# Pred Actual 
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97 
63 
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81 
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0.349 
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0.560 
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Table 1: Comparison of calibration for predictions found with 
and without correction for selection bias, on data simulated 
from the binary naive Bayes model. Results are shown with 
four subsets of features and with the complete data (for which 
no correction is necessary). The test cases were divided into 
10 categories by the first decimal of the predictive probablity 
of class 1. The table shows the number of test cases in each 
category for each method ("#"), the average predictive prob- 
ability of class 1 for cases in that category ( "Pred" ) , and the 
actual fraction of these cases that were in class 1 ( "Actual" ) . 
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Figure 5: Posterior distributions of log(a) for the simulated data, with different numbers of features 
selected. The true value of log(a) is 5.7, shown by the vertical line. The solid line is the posterior 
density using all features. For each number of selected features, the dashed line is the posterior 
density including the factor that corrects for selection bias; the dotted line is the posterior density 
without bias correction. The dashed and solid lines overlap in the bottom two graphs. The dots 
mark the values of log(a) used to approximate the density, at the Q.b/K, 1.5/K, . . . , {K — 0.5)/ K 
quantiles of the prior distribution (where K = 30). The probabilities of x'"'"" at each of these 
values for a were computed, rescaled to sum to K, and finally multiplied by the Jacobian, aP{a), 
to obtain the approximation to the posterior density of log(a). 

shows the average predictive probability for class 1 and the actual fraction of cases in class 1 
for test cases grouped according to the first decimal of their predictive probabilities, for both 
the uncorrected and the corrected methods. Results are shown using subsets of 1, 10, 100, and 
1000 features, and using all features. We see that the uncorrected method produces overconfident 
predictive probabilities, either too close to zero or too close to one. The corrected method avoids 
such bias (the values for "Pred" and "Actual" are much closer), showing that it is well calibrated. 

The biased predictions of the uncorrected method result from an incorrect posterior distribution 
for a, as illustrated in Figure 5. Without bias correction, the posterior based on only the selected 
features incorrectly favours values of a smaller than the true value of 300. Multiplying by the 
adjustment factor corrects this bias in the posterior distribution. 

Our software (available from http : //www.utstat .utoronto . ca/'^longhai) is written in the 
R language, with some functions for intensive computations such as numerical integration and 
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computation of the adjustment factor written in C for speed. We approximated the integral with 
respect to a using the midpoint rule with A' = 30 values for F~^(ci)^ as discussed at the end of 
Section 3.3. The integrals with respect to Q in equations (23) and (38) were approximated using 
Simpson's Rule, evaluating Q at 21 points. 

Computation times for each method (on a 1.2 GHz UltraSPARC III processor) are shown on 
the right in Figure 4. The corrected method is almost as fast as the uncorrected method, since 
the time to compute the adjustment factor is negligible compared to the time spent computing the 
integrals over Oj for the selected features. Accordingly, considerable time can be saved by selecting 
a subset of features, rather than using all of them, without introducing an optimistic bias, though 
some accuracy in predictions may of course be lost when we discard the information contained in 
the unselected features. 

5 A test using gene expression data 

We also tested our method using a publicly available dataset on gene expression in normal and 
cancerous human colon tissue. This dataset contains the expression levels of 6500 genes in 40 
cancerous and 22 normal colon tissues, measured using the Affymetrix technology. The dataset 
is available at http : //geneexpression. cinj . org/'^notterman/af f yindex .html. We used only 
the 2000 genes with highest minimal intensity, as selected by Alon, Barkai, Notterman, Gish, Mack, 
and Levine (1999). In order to apply the binary naive Bayes model to the data, we transformed 
the real-value data into binary data by thresholding at the median, separately for each feature. 

We divided these 2000 genes randomly into 10 equal groups, producing 10 smaller datasets, each 
with 200 binary features, as well as the binary class (normal/cancerous). We applied the corrected 
and uncorrected methods separately to each of these 10 datasets, allowing some assessment of 
variability when comparing performance. For each of these 10 datasets, we used leave-one-out cross 
validation to obtain predictive probabilities for the class in the 62 cases. In this cross-validation 
procedure, we left out each of the 62 cases in turn, selected the five features with the largest sample 
correlation with the class (in absolute value), and found the predictive probability for the left-out 
case using the binary naive Bayes model, with and without bias correction. The absolute value of 
the correlation of the last selected feature with the class was always around 0.5. We used the same 
prior distribution, and the same computational methods, as for the demonstration in Section 4. 

Figure 6 plots the predictive probabilities of class 1 for all cases, with each of the 10 subsets of 
features. The tendency of the uncorrected method to produce more extreme probabilities (closer to 
and 1) is clear. However, when the predictive probability is close to 0.5, there is little difference 
between the corrected and uncorrected methods. Accordingly, the two methods usually classify 
cases the same way, if classification is done by thresholding the predictive probability at 0.5, and 
have very similar error rates. (The overall average error rate is 0.194 for the uncorrected method and 
0.182 for the corrected method.) Note, however, that correcting for bias would have a substantial 
effect if cases were classified by thresholding the predictive probability at some value other than 
0.5, as would be appropriate if the consequences of an error are different for the two classes. 

Figure 7 compares the two methods in terms of average minus log probability of the correct class 
and in terms of average squared error. From these plots it is clear that bias correction improves the 
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Figure 6: Scatterplots of the predictive probabilities of class 1 
for the 10 subsets drawn from the colon cancer gene expression 
data, with and without correction for selection bias. Black 
circles are cases that are actually in class 1 (cancer); hollow 
circles are cases that are actually in class 0. Note that many 
cases with predictive probabilities close to or 1 may overlap. 
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Figure 7: Scatterplots of the average minus log probability of the correct class and of the average 
squared error (assessed by cross validation) when using the 10 subsets of features for the colon 
cancer gene expression data, with and without correcting for selection bias. 
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Figure 8: Actual versus expected error rates on the colon cancer datasets, with and without bias 
correction. Points are shown for each of the 10 subsets of features used for testing. In the plot on 
the right, two points above the diagonal are almost superimposed. 
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predictive probabilities. In terms of average minus log probability, the corrected method is better 
for all 10 datasets, and in terms of average squared error, the corrected method is better for 8 out 
of 10 datasets. (A paired t test with these two measures produced p-values of 0.00007 and 0.019 
respectively.) 

Finally, Figure 8 shows that our bias correction method reduces optimistic bias in the predictions. 
For each of the 10 datasets, this plot shows the actual error rate (in the leave-one-out cross-validation 
assessment) and the error rate expected from the predictive probabilities. For all ten datasets, the 
expected error rate with the uncorrected method is substantially less than the actual error rate. 
This optimistic bias is reduced in the corrected method, though it is not eliminated entirely. The 
remaining bias presumably results from the failure in this dataset of the naive Bayes assumption 
that features are independent within a class. 

6 Conclusions and Future Work 

We have proposed a Bayesian method for making well-calibrated predictions for a response variable 
when using a subset of features selected from a larger number based on some measure of dependency 
between the feature and the response. Our method results from applying the basic principle that 
predictive probabilities should be conditional on all available information — in this case, including 
the information that some features were discarded because they appear weakly related to the re- 
sponse variable. This information can only be utilized when using a model for the joint distribution 
of the response and the features, even though we are interested only in the conditional distribution 
of the response given the features. 

We applied this method to naive Bayes models with binary features that are assumed to be 
independent conditional on the value of the binary response (class) variable. With these models, we 
can efficiently compute the adjustment factor needed to correct for selection bias. Crucially, we need 
only compute the probability that a single feature will exhibit low correlation with the response, and 
then raise this probability to the number of discarded features. When a large number of features 
are discarded, the time needed to compute the adjustment factor for bias correction is much less 
that the time that would have been needed to actually use these features. Substantial computation 
time can therefore be saved by discarding features that appear to have little relationship with the 
response. 

Our general method can be applied to other models and other feature selection criteria, provided 
that the adjustment factor can be computed. Reasonably efficient computation may be possible 
when the features for a case are independent given the values for a set of latent variables, since 
the adjustment factor can then again be found by raising the probability that a single feature will 
be discarded to the number of features that were discarded. However, since the values for latent 
variables will not be known, the computations are more difficult than for the naive Bayes model. 
Markov chain Monte Carlo methods will generally be needed to sample for the values of the latent 
variables. (They may be required in any case for models more complex than the binary naive Bayes 
model considered in this paper.) 

We have implemented such bias correction methods for two-component mixture models of binary 
data, and for factor analysis models, in which the features and the response are real valued. The 
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required computations are feasible, but slower and more complex than for the naive Bayes model. 
We will report the details of these methods and their performance in follow-on papers. The practical 
utility of the bias correction method we describe would be much improved if methods for more 
efhciently computing the required adjustment factor could be found, which could be applied to a 
wide class of models. 
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